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ABSTRACT 

A  random  sample  of  size  N  is  divided  into  k  clusters  that  minimize  the 
within  cluster  sum  of  squares  locally.   This  k-means  clustering  method  can 
be  used  as  a  quick  procedure  for  constructing  variable-cell  historgrams  that 
have  no  empty  cell.   A  histogram  estimate  is  proposed  in  this  paper,  and  is 
shown  to  be  uniformly  consistent  in  probability. 

1.   INTRODUCTION 
Let  X  ,  X^,  ...,  X^  be  observations  from  some  density  f  of  a  probability 
distribution  F.   In  one  dimension,  the  traditional  method  in  estimating  the 
univariate  density  f  is  the  histogram.   The  asymptotic  properties  of  the  fixed 
cell  historgram  are  given  in  Tapia  and  Thompson  (1978).   A  major  difficult  of 
multivariate  histograms,  obtained  by  partitioning  the  sampled  space  into  cells 
of  equal  size,  is  that  there  are  too  many  cells  with  very  few  observations. 
Van  Ryzin  (1973)  first  proposed  a  variable  cell  histogram  which  is  adaptive  to 


the  underlying  univariate  density.   Kim  and  Van  Ryzin  (1967) 
extended  this  method  to  the  bivariate  case;  but  the  general 
procedure  for  the  multivariate  case  is  very  complicated. 

On  the  other  hand,  the  theoretically  sound  density  esti- 
mation techniques  like  the  kernel  method  (Parzen,  1962)  and  the  kth 
nearest  neighbor  method  (Lof tsgaarden  and  Quensenberry ,  1965)  have 
computational  problems  when  applied  to  large  data  sets.   (For  the 
asymptotic  consistency  of  these  techniques,  see  for  example, 
Devroye  and  Wagner  (1977),  Moore  and  Yackel  (1977),  and  Silverman 
(1978).)   Although  the  statistical  justification  of  these  density 

estimates  require  N  very  large,  the  number  of  computations  is 

2 
usually   0(N  )   which  begins  to  be  onerous  for  N  over  500.   In 

this  paper,  it  is  proposed  that  the  widely-used  k-means  clustering 

technique  can  be  regarded  as  a  practical  and  convenient  way  of 

obtaining  variable  cell  histograms  in  one  or  more  dimensions;  the 

computational  requirement  of  this  algorithm  is   0(Nk) ,   where   k 

is  the  number  of  cells  or  clusters. 

Suppose  that  the  observations   x,  ,  ...,  x,^^  are  partitioned 

into  k  groups  with  means   u, ,  u„ ,  ...,  u,   such  that  no  movement 

of  an  observation  from  one  group  to  another  will  reduce  the  within 

groups  sum  of  squares 

WSS(N)  =   Z   min    |  |  x.  -  u.  |  |  . 
i=l  l<j<k      ^     ^ 

This  technique  for  division  of  a  sample  into   k   clusters  to 
minimize  the  within  group  sum  of  squares  locally  is  known  in  the 
clustering  literature  as  k-means.   In  one  dimension,  the  partition 
will  be  specified  by  k-1   cutpoints;  the  observations  lying  be- 
tween common  cutpoints  are  in  the  same  group.   See  Hartigan  (1975) 
for  a  detailed  description  of  the  k-means  technique,  and  see 
Hartigan  and  Wong  (1979)  for  an  efficient  computational  algorithm. 
The  asymptotic  properties  of  k-means  as  a  clustering  procedure 
(as   N  approaches  ^     with  k   fixed)  have  been  studied  by 
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MacQueen  (1967),  Hartigan  (1978),  and  Pollard  (1981).   In  this  paper, 
it  is  shown  that  the  k-means  procedure  can  be  used  to  construct  a 
histogram  estimate  of  the  underlying  density  function. 

In  Section  2,  using  the  asymptotic  properties  of  k-means 
clusters  (when  k  ->•  =°  with  N)  given  in  Wong  (1980),  it  is  shown 
that  the  proposed  histogram  estimate  is  uniformly  consistent  in 
probability  in  one  dimension.   The  multivariate  case  requires 
further  investigation  as  the  generalization  of  the  univariate  con- 
sistency result  to  many  dimensions  is  not  straightforward.   How- 
ever, empirical  examples  are  given  in  Section  3  to  illustrate  the 
potential  of  k-means  as  a  practical  density  estimation  procedure 
for  large  multivarate  data  sets.   Some  conclusing  remarks  are  also 
given  in  Section  3. 

2.   WEAK  UNIFORM  CONSISTENCY  OF  THE  K-MEANS  HISTOGRAM  ESTIMATE 

In  this  section,  a  k-means  histogram  estimate  of  an  unknown 
univariate  density  function  is  proposed  which  is  shown  to  be 
uniformly  consistent  in  probability. 

Let   X,,  ...,  Xv,  be  observations  from  a  density  function   f 

which  is  positive  and  has  four  bounded  derivatives  in  [a,b]. 

Suppose  that  the  N  observations  are  grouped  into  k,^   clusters 

with  means   u,  <  u-  <  . . .  <  u,    and  within-cluster  sums  of 
12  k^ 

squares  WSS, ,  . . . ,  WSS,    such  that  the  within  groups  sum  of 

N 
squares 

^N  N 

WSS(N)  =   Z   WSS.(N)  =  E   min    ||x.  -  u.  || 
j=l     ^      i=l  Uj^k^    ^    ^ 

of  this  locally  optimal   k^-partition  cannot  be  decreased  by  moving 
any  single  observation  from  its  present  cluster  to  any  other 
cluster.   Let   I.  =  [y._i>  Y  ]   be  the  set  of  points  in  [a,b] 
closer  to   u.   than  to  any  other  cluster  mean.   Then   {I,,..., I,  } 
is  the  k^-partition  of  [a,b]  defined  by  the  cluster  means 
u^,  ...,  u^,   and  a=yQ  <  y^  <  . . .  <  Yu  _2^<  ^k,  "  ^   ^^^  ^^^ 
cutpoints  of  this  partition.   Denote  the  size  of  the  j th  cluster 


interval  of  this  partition  by  e.   and  let  the  number  of  observa- 
tions in  the   j th   cluster  by  n..   Then  it  is  shown  in  Wong 
(1980)  that  if  k^  =  o  ( [N/log  N]^/^), 

max    |e.  Ic,  f.^^^  -  /^  f(x)^^^  dx  1  =  o  (1),        (2.1) 

where   f.   is  the  density  at  the  midpoint  of  the  jth  cluster 
interval, 

max    |n.  n"V,  f.~^^^  -  (/^  f(x)^''-^  dx)  |  =  o„(l)      (2.2) 
l<j<k^   2  ^     2  a  P 

and 

max    I  12  WSS.  n"-""  k^,^  -  (/''  f(x)^^-^dx)^  |  =  o  (1) .    (2.3) 

From  (2.1),  (2.2),  and  (2.3)  respectively,  by  putting 

N 


K       I/O 

G  =  /   f(x)     dx,   we  have  uniformly  in   l<j<k-,. 


e.  =  G  k"/  f.^^^   [1+0  (1)]  (2.4) 

J       N   J  p 

n.  =  G  N  kT/  f^J^    [1  +  0  (1)]  (2.5) 

2  TI   J    '      p    ' 

and     WSS.  =  YT  G^  N  k""^  [1  +  o  (1)]  (2.6) 

J    12       N        p 

Therefore,  in  constructing  a  histogram  to  estimate  an  unknown 
density  function   f  which  vanishes  outside  the  finite  interval 
[a,b],   equation  (2.4)  indicates  that  the  k-means  procedure  would 
partition  [a,b]  in  such  a  way  that  the  sizes  of  the  intervals  are 
adaptive  to  the  underlying  density;  the  intervals  are  large  where 
the  density  is  low  while  the  intervals  are  small  where  the 
density  is  high.   It  follows  that  the  k-means  procedure  can  be 
regarded  as  a  useful  tool  for  constructing  variable-cell  histograms. 
Define  the  density  estimate  at  a  point  x  by 


f^(x)    =  n^/^/N(12  WSS^)^^^,    y._^s^<y.    for   l^j^k^^,  (2.7) 

Then   from   (2.5)    and    (2.6),    we  have 

f^(x)    =    (GN  k^^    f2/3)3/2    ^^  ^  Op(l)]/N(G^Nk^^)^''^    [1  +  Op(l)] 

=    f.    [1  +  o    (1)],    uniformly   in   l<j^k^. 

Since      f      is   uniformly   continuous, 

sup       I    fjj(x)    -    f(x)     I    =   o    (1). 
a^xib 

And  we  have  shown  that  the  histogram  estimate   f   is  uniformly 
consistent  in  probability. 

3.   EMPIRICAL  ANALYSIS  OF  THE  K-MEANS  HISTOGRAM  ESTIMATE 

The  results  in  Section  2  indicate  that  the  k-means  procedure 
can  be  used  to  construct  a  uniformly  consistent  histogram  estimate 

of  an  unknown  univariate  density   f   provided  that  k^,  is  chosen 

1/3 
such  that  k^   is  of  order   o( [N/log  N]    ) .   An  empirical  study 

was  performed  to  examine  the  performance  of  the  k-means  histogram 

estimate  (see  Wong  1979),  in  which  the  effectiveness  of  various 

choices  of  k   was  also  compared.   There  is  empirical  evidence 

0  3 
that  the  choice   k  =  4N  '    is  effective  over  a  range  of  sample 

sizes  for  various  normal  mixture  densities.   Two  examples  are 
given  in  Figure  1  and  Figure  2,  in  which  the  performance  of  the 
k-means  estimate  (k=40)  is  illustrated  by  using  1000  generated 
observations  from  two  different  normal  mixtures.   The  CPU  time 
consumed  on  the  IBM  370/158  for  the  two  examples  are  12.95  seconds 
and  14.18  seconds  respectively. 

A  major  difficulty  of  the  usual  histogram  is  that  when  multi- 
variate histograms  are  constructed  by  partitioning  the  sampled 
space  into  cells  of  equal  size,  there  are  too  many  empty  cells. 
One  desirable  feature  of  the  k-means  procedure  is  that  it  provides 


a  practical  and  convenient  way  of  obtaining  a  k-partition  of  the 
multivariate  sample,  or  equivalently ,  the  multidimensional 
sampled  space.   Consequently,  histogram  estimates  of  the  density 
over  these   k   cells  or  regions  (whose  sizes  are  conjectured  to  be 
adaptive  to  the  underlying  density)  can  be  obtained.   However,  the 
uniform  consistency  of  such  a  multivariate  histogram  has  not  been 
established,  and  much  work  has  yet  to  be  done  to  investigate  the 
asymptotic  properties  of  k-means  partitions  of  samples  from  multi- 
dimensional distributions. 

Empirical  bivariate  examples  are  included  here  to  illustrate 
the  potential  of  the  k-means  technique  as  a  practical  density  esti- 
mation procedure  for  large  multivariate  data  sets.   The  three  gene- 
rated data  sets  are  samples  of  size  1000  from  (1)  a  bivariate  norm- 
al with  mean  (0,0)  and  covariance  matrix  (p.-,)',    i.e.  ,  BVN  [(0,0), 
(J°)],  (2)  the  mixture  ^BVN  [(0,0),  (J°)  ]  +  JbVN  [(3,3),  (J,°)  ]  , 
and  (3)  the  mixture  ^BVN  [(0,0),  (^p ]  +  ^BVN  [(0,6),  (q^)].   The 

density  estimates  (f„(x)  <=^  n .  "^   ~   WSS.~^   ;  p=2  for  bivariate 

N       1         1     '  ^ 

data)  over  the  k=40  clusters  obtained  by  k-means  are  given  respec- 
tively in  Figures  3,  4,  and  5.   The  results  suggest  that  k-means 
is  a  useful  tool  for  estimating  density.   Moreover,  the  computa- 
tional requirement  is  only   0(Nk) ,   which  is  considerably  less 

prohibitive  than  the  usual  kernel  and  nearest  neighbor  techniques 

2 
which  require   0(N  )   computations;  the  average  CPU  time  on  the 

IBM  370/158  for  the  three  bivariate  examples  is  22.32  seconds. 
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